23 research outputs found
Deep Network Flow for Multi-Object Tracking
Data association problems are an important component of many computer vision
applications, with multi-object tracking being one of the most prominent
examples. A typical approach to data association involves finding a graph
matching or network flow that minimizes a sum of pairwise association costs,
which are often either hand-crafted or learned as linear functions of fixed
features. In this work, we demonstrate that it is possible to learn features
for network-flow-based data association via backpropagation, by expressing the
optimum of a smoothed network flow problem as a differentiable function of the
pairwise association costs. We apply this approach to multi-object tracking
with a network flow formulation. Our experiments demonstrate that we are able
to successfully learn all cost functions for the association problem in an
end-to-end fashion, which outperform hand-crafted costs in all settings. The
integration and combination of various sources of inputs becomes easy and the
cost functions can be learned entirely from data, alleviating tedious
hand-designing of costs.Comment: Accepted to CVPR 201
Hybrid One-Shot 3D Hand Pose Estimation by Exploiting Uncertainties
Model-based approaches to 3D hand tracking have been shown to perform well in
a wide range of scenarios. However, they require initialisation and cannot
recover easily from tracking failures that occur due to fast hand motions.
Data-driven approaches, on the other hand, can quickly deliver a solution, but
the results often suffer from lower accuracy or missing anatomical validity
compared to those obtained from model-based approaches. In this work we propose
a hybrid approach for hand pose estimation from a single depth image. First, a
learned regressor is employed to deliver multiple initial hypotheses for the 3D
position of each hand joint. Subsequently, the kinematic parameters of a 3D
hand model are found by deliberately exploiting the inherent uncertainty of the
inferred joint proposals. This way, the method provides anatomically valid and
accurate solutions without requiring manual initialisation or suffering from
track losses. Quantitative results on several standard datasets demonstrate
that the proposed method outperforms state-of-the-art representatives of the
model-based, data-driven and hybrid paradigms.Comment: BMVC 2015 (oral); see also
http://lrs.icg.tugraz.at/research/hybridhape
Exploring Question Decomposition for Zero-Shot VQA
Visual question answering (VQA) has traditionally been treated as a
single-step task where each question receives the same amount of effort, unlike
natural human question-answering strategies. We explore a question
decomposition strategy for VQA to overcome this limitation. We probe the
ability of recently developed large vision-language models to use human-written
decompositions and produce their own decompositions of visual questions,
finding they are capable of learning both tasks from demonstrations alone.
However, we show that naive application of model-written decompositions can
hurt performance. We introduce a model-driven selective decomposition approach
for second-guessing predictions and correcting errors, and validate its
effectiveness on eight VQA tasks across three domains, showing consistent
improvements in accuracy, including improvements of >20% on medical VQA
datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA
reformulation of the challenging Winoground task. Project Site:
https://zaidkhan.me/decomposition-0shot-vqa/Comment: NeurIPS 2023 Camera Read
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
Monocular 3D object localization in driving scenes is a crucial task, but
challenging due to its ill-posed nature. Estimating 3D coordinates for each
pixel on the object surface holds great potential as it provides dense 2D-3D
geometric constraints for the underlying PnP problem. However, high-quality
ground truth supervision is not available in driving scenes due to sparsity and
various artifacts of Lidar data, as well as the practical infeasibility of
collecting per-instance CAD models. In this work, we present NeurOCS, a
framework that uses instance masks and 3D boxes as input to learn 3D object
shapes by means of differentiable rendering, which further serves as
supervision for learning dense object coordinates. Our approach rests on
insights in learning a category-level shape prior directly from real driving
scenes, while properly handling single-view ambiguities. Furthermore, we study
and make critical design choices to learn object coordinates more effectively
from an object-centric view. Altogether, our framework leads to new
state-of-the-art in monocular 3D localization that ranks 1st on the
KITTI-Object benchmark among published monocular methods.Comment: Paper was accepted to CVPR 202
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Recent progress in large-scale vision-language pre-training has shown the
importance of aligning the visual and text modalities for downstream
vision-language tasks. Many methods use a dual-stream architecture that fuses
visual tokens and language tokens after representation learning, which aligns
only at a global level and cannot extract finer-scale semantics. In contrast,
we propose a single stream model that aligns the modalities at multiple levels:
i) instance level, ii) fine-grained patch level, iii) conceptual semantic
level. We achieve this using two novel tasks: symmetric cross-modality
reconstruction and a pseudo-labeled key word prediction. In the former part, we
mask the input tokens from one of the modalities and use the cross-modal
information to reconstruct the masked token, thus improving fine-grained
alignment between the two modalities. In the latter part, we parse the caption
to select a few key words and feed it together with the momentum encoder pseudo
signal to self-supervise the visual encoder, enforcing it to learn rich
semantic concepts that are essential for grounding a textual token to an image
region. We demonstrate top performance on a set of Vision-Language downstream
tasks such as zero-shot/fine-tuned image/text retrieval, referring expression,
and VQA. We also demonstrate how the proposed models can align the modalities
at multiple levels.Comment: 22 pages, 7 figure